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Abstract.  In  this  paper,  we  described  our  method  for  the  expert  search  task  in 
TREC  2008.  First,  we  proposed  an  adaption  to  the  language  modeling  method 
for  expert  search,  which  considers  the  probability  of  query  generation  separate¬ 
ly  using  each  kind  of  expert  evidence  (full  name,  abbreviated  name,  and  email 
address).  Current  expert  search  models  can  be  easily  integrated  into  our  method. 
Our  experiments  indicated  that  our  method  can  make  use  of  the  ambiguous  evi¬ 
dence  in  expert  search  (abbreviated  name),  which  often  casued  a  drop  in  effects 
in  other  methods.  Besides,  we  also  used  a  probabilistic  measure  to  detect  phrase 
in  query,  but  it  did  not  make  better  effectiveness. 
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1  Introduction 

In  recent  years,  much  attention  has  been  focused  on  the  task  of  expert  search.  A  lot  of 
effective  models  have  been  proposed,  which  all  made  use  of  the  co-occurrence  infor¬ 
mation  around  expert  evidence.  This  intuition  is  proved  to  be  quite  effective. 

The  language  modeling  methods  are  widely  used  in  expert  search.  In  TREC  2005, 
Cao  et  al.  [1]  and  Azzopardi  et  al.  [2]  introduced  two  language  modeling  methods  for 
the  expert  search  task.  These  methods  were  later  explained  by  Balog  et  al.  [3]  as  can¬ 
didate  model  (model  1)  and  document  model  (model  2).  Fang  et  al.  [4]  also  proposed 
a  similar  framework,  but  they  explicitly  modeled  on  relevance  and  adopted  the  proba¬ 
bility  ranking  principle  to  rank  experts. 

Further,  some  detailed  problems  were  studied  under  the  framework  proposed  in  [3]. 
Petkova  et  al.  [5]  proposed  a  method  to  consider  the  dependency  between  terms  and 
candidates  using  proximity-based  measures.  Balog  et  al.  [6]  elaborated  the  estimation 
of  candidate-document  association.  Serdyukov  et  al.  [7]  explored  relevance  propaga¬ 
tion  in  expert  search.  Balog  et  al.  [8]  considered  non-local  information  available  in  the 
collection  for  expert  search.  For  a  thorough  review,  please  refer  to  [9]  and  [10]. 

The  language  modeling  methods  have  also  been  proved  to  be  effective  under  envi¬ 
ronments  other  than  the  TREC  collections.  For  example,  Balog  et  al.  [11]  created  the 
UvT  collection,  which  involved  web  pages  with  multi-linguistic  features  from  a  uni- 
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versity.  Besides,  Serdyukov  et  al.  [12]  and  Jiang  et  al.  [13]  applied  the  language  mod¬ 
eling  method  to  the  internet  environment  and  testified  its  effectiveness  using  search 
engine  results. 

It  is  the  third  year  that  our  group  participated  in  the  TREC  expert  search  task.  In 
TREC  2006,  we  adopted  a  window-based  method  for  expert  search  [14].  We  firstly 
built  pseudo-profiles  for  each  expert  using  their  co-occurrence  information  in  the 
documents,  and  then  searched  for  relevant  profiles  using  text  retrieval  models.  In 
TREC  2007,  we  adopted  a  simple  ranking  model  [15],  in  which  an  expert’s  score  is 
the  linear  combination  of  scores  for  all  the  supporting  documents.  In  addition,  we  had 
adopted  several  methods  to  filter  out  invalid  supporting  documents,  which  can  effec¬ 
tively  enhance  precision  [16]. 

This  year,  we  mainly  focus  on  the  ambiguity  of  expert  evidence  in  expert  search.  In 
the  collection,  an  expert  can  appear  in  several  kinds  of  evidence.  Some  evidences  are 
ambiguous  and  can  denote  more  than  one  expert. 

We  only  consider  three  main  kinds  of  evidence  here: 

1.  eVf„:  full  name,  e.g.  “Jiepu  Jiang”; 

2.  evahhr:  abbreviated  name,  e.g.  “J.  Jiang”,  “Jiang”; 

3.  evem:  email  address,  e.g.  “jiepu.jiang@gmail.com”. 

Generally,  evem  is  often  the  most  explicit  form  of  evidence,  while  ev/„  and  evabbr  can 
be  ambiguous.  The  ambiguity  depends  on  the  size  of  the  collection.  For  example,  ev/„ 
is  mostly  explicit  in  an  enterprise  that  involves  a  few  thousand  experts,  but  it  can  be 
highly  ambiguous  over  the  internet.  evabb,-  is  often  highly  ambiguous,  even  only  in  an 
enterprise  size  collection. 

Current  models  [4]  [5]  [6]  [10]  consider  the  ambiguity  by  involving  a  combination  of 
several  kinds  of  evidence  with  different  weights  in  estimating  the  candidate-document 
associations.  But  most  experiments  indicated  that  using  evabb,-  will  result  in  a  loss  in 
performance. 

In  this  paper,  we  focus  on  better  use  of  the  ambiguous  expert  evidence,  i.e.  evabbr- 
In  contrast  to  other  models,  our  proposed  model  estimates  p(e\q)  separately  by  each 
kind  of  evidence  that  can  denote  e.  Besides,  we  have  also  used  an  auxiliary  method  in 
our  experiments,  i.e.  phrase  detection  in  the  query. 

The  remainder  of  this  paper  is  organized  as  follows:  in  section  2,  we  will  describe 
our  methods;  in  section  3,  we  will  explain  our  experiments  and  submitted  runs;  in  the 
end,  we  will  draw  a  conclusion  and  discuss  some  future  challenges. 


2  Modeling  Expert  Evidence  in  Expert  Search 

In  this  section,  we  will  mainly  explain  our  model  for  expert  search,  which  considers 
each  kind  of  evidence  separately. 

From  a  language  modeling  perspective,  we  can  rank  experts  by  p(e\q),  the  proba¬ 
bility  that  the  query  q  is  generated  by  the  expert  e.  Then,  applying  Bayes  rules,  p(e\q) 
can  be  transformed  as  Eq.  (1): 

p(q\e)xp(e) 

p(q) 


P(e\  q) 


(1) 


In  a  specific  ranking  task,  p(q)  is  constant  for  each  e  and  can  be  ignored  in  ranking. 
Besides,  we  simply  assume  the  same  prior  probability  p(e)  for  each  expert  e  here.  As 
a  result,  we  can  rank  experts  by  p(q\e). 

We  represent  all  possible  kinds  of  evidence  for  e  as  a  set  EVe{eVi},  in  which  ev,  can 
be  any  of  evfn,  evabbr,  or  evem  that  can  denote  e.  Further,  the  whole  event  space  can  be 
partitioned  into  several  subsets  for  each  ev,.  By  the  total  probability  formula,  we  can 
transform p(q\e)  as  Eq.  (2): 

p(a\e)=  Yj  p(q\evi,e)xp(evi\e)  (2) 

ev,  &EVe 

In  Eq.  (2),  p(ev,|e)  is  the  probability  that  e  will  appear  in  the  form  of  ev,-,  which  can 
be  explained  as  a  kind  of  expert-evidence  association,  and  p(q\eVj,e)  is  the  probability 
that  e  will  generate  q  when  it  appears  in  the  form  of  ev„  which  can  be  explained  as  the 
evidence-topic  association. 

In  the  rest  of  this  section,  we  will  further  explain  our  estimatition  for  p(ev,\e)  and 
p(q \eVj,e).  For  simplification,  we  have  adopted  the  following  assumption  here: 

Assumption  1:  evjn  and  evem  are  explicit,  and  only  evabbr  can  be  ambiguous. 

In  assumption  1,  we  mean  to  only  consider  the  ambiguity  of  evabbr.  The  ambiguity 
of  eVfn  is  ignored  here,  since  it  is  rare  in  an  enterprise  collection  that  two  persons  have 
the  same  full  name. 


2.1  Expert-Evidence  Association 


The  association  between  expert  and  evidence  is  measured  as  p(ev, \e),  the  probability 
that  e  appears  in  the  form  of  ev,-.  In  our  model,  we  estimate  it  using  a  maximum  like¬ 
lihood  estimation  as  Eq.  (3): 


P(evi  I  e) 


tfjev^e) 

X  tf(ev.re) 

evj  sEVe 


(3) 


In  Eq.  (3),  t/(ev„e)  is  the  frequency  of  ev,  that  denotes  e  in  the  whole  collection, 
and  Eve  is  the  set  of  all  possible  evidences  that  can  denote  e. 

For  tf{evfn,e)  and  tj[evem,e),  we  can  easily  count  them  as  tf[evf ■„)  and  tf[evein),  which 
are  the  frequency  of  ev/„  and  ev,.„,  in  the  whole  collection,  since  they  are  assumed  to 
explicitly  denote  e  here  (assumption  1). 

Now,  the  rest  of  the  question  is  how  to  estimate  tf[evabbr,e),  which  is  the  frequency 
of  evabbr  that  denotes  e  in  the  whole  collection.  The  problem  is  that  evabbr  may  denote 
more  than  one  expert  and  the  disambiguation  cannot  be  perfect.  For  this  problem,  we 
first  adopt  a  rule  that  can  disambiguate  evabbr  with  certainty  in  a  part  of  the  collection, 
and  then  use  the  distribution  of  evabbr  in  the  disambiguated  part  to  infer  its  distribution 
in  the  whole  collection. 

First,  though  we  cannot  disambiguate  evabbr  in  all  the  documents  of  the  collection, 
we  can  disambiguate  evabbr  in  the  documents  that  contain  a  related  expert’s  ev/n  or  evem. 
For  example,  if  “Jiepu  Jiang”  is  mentioned  in  a  document,  we  know  that  “J.  Jiang”  in 
the  same  document  refers  to  “Jiepu  Jiang”.  Similarly,  if  “Jay  Jiang”  is  mentioned,  we 


know  that  “J.  Jiang”  in  the  document  is  not  “Jiepu  Jiang”.  For  a  simplification,  we  do 
not  consider  the  circumstance  that  both  “Jiepu  Jiang”  and  “Jay  Jiang”  appear  in  a 
document.  Using  this  method,  we  can  disambiguate  evabbr  in  a  part  of  the  collection. 
We  represent  evabbr  in  the  disambiguated  part  of  collection  as  ev  ’abbr- 

Then,  we  can  transform  tf[ev abbr,e)  as  Eq.  (4): 

tf (evabbr,e)  =  tf  (evabbr)x  p(e  \  evabbr)  (4) 

In  Eq.  (4),  tf[evabb ,•)  is  the  frequency  of  evabbr  in  the  whole  collection  (no  matter  it 
denotes  e  or  not),  and  p{e\evabb,)  is  the  proportion  of  evabbr  that  denotes  e  in  the  whole 
collection. 

Further,  assuming  the  distribution  of  evabbr  that  denotes  e  is  independent  with  the 
disambiguation  of  evabbn  we  can  estimate  p(e \evabbl)  as  Eq.  5. 

p(e  I  evMr )*P(e  |  ev'abbr )  =  (5) 

tf(eVabbr) 

In  Eq.  (5),  ev’abbr  refers  to  evabbr  that  can  be  disambiguated,  and  p(e\ev’abbr)  is  the 
proportion  of  evabb,  that  denotes  e  in  the  disambiguated  part  of  collection,  which  can 
be  estimated  as  the  right  part  of  Eq.  (5). 


2.2  Evidence-Topic  Association 

The  association  between  evidence  and  the  topic,  i.e.  p(q\eVi,e),  is  the  probability  that  e 
will  generate  q  when  it  appears  in  the  form  of  ev,.  In  another  perspective,  it  can  also 
be  explained  as  the  probability  that  ev,  will  generate  q  when  it  denotes  e.  We  adopt  the 
latter  explaination,  since  it  can  be  easily  associated  with  previous  language  modeling 
methods  for  expert  search. 

p{q\evf,„e)  and  p{q\evem,e)  can  be  estimated  directly  as  p(q\eVf„)  and  p(q\evem),  since 
ev/„  and  eve,„  are  assumed  to  be  explicit  (assumption  1).  The  estimation  of  p(q\eVf„) 
and p(q \evem)  is  quite  similar  to  previous  models  for  expert  search. 

For  evabbn  we  can  also  adopt  the  intuition  in  2.1.  First,  we  can  disambiguate  evabbr 
in  a  part  of  the  collection.  Then,  we  can  estimate  p(q\evabbr,e)  as  Eq.  (6): 

PiS  I  eVabbr’e)  ~  P(Q  I  eV'abbr’e)  =  P(<1  I  eV'abbr-e) 

In  Eq.  (6),  p(q\ev  ’abbr,e)  refers  to  the  probability  of  query  generation  for  evabbr  that 
denotes  e  in  the  disambiguated  part  of  collection.  For  simplification,  we  use  ev  ’abbr-e  to 
represent  evabbr  that  denote  e  in  the  disambiguated  part  of  collection. 

Further,  we  can  estimate  the  probability  of  query  generation  for  each  of  ev/„,  evc,„, 
and  ev  ’abbr-e  by  previous  language  modeling  methods  for  expert  search.  In  this  paper, 
we  only  adopted  a  frequently  cited  method,  i.e.  model  2  in  [3],  as  in  Eq.  (7),  Eq.  (8) 
and  Eq.  (9): 

P(q  I  ev,-)  =  X  P(V  I  dj’evi) x P(dj  I  evi) 

djeD 


(7) 


p(q\d,evi)*p(q\d)  =  Ylpmr* =Y[^-vpmI(t\d)+APc(t)f,’q)  m 

t&q  t&q 


P(d  |  ev,.)  oc  p(evi  \  d ) 


tf(ev„d ) 

Yjtfievpd ) 


(9) 


In  Eq.  (8),  n(t,q)  is  the  frequency  of  term  t  in  q,  p,„i(t\d)  is  the  maximum  likelihood 
estimation  for  the  probability  of  t  in  d,  pc(t )  is  the  probability  of  t  in  the  whole  collec¬ 
tion,  1  is  a  smoothing  parameter  which  is  contantly  set  to  0.5  here.  In  Eq.  (9),  tf[evhd) 
is  the  frequency  of  ev,-  in  d. 

In  fact,  we  can  also  adopt  other  language  modeling  methods  for  expert  search  with 
similar  framework  to  [3],  e.g.  [5][6][8]. 


2.3  Other  Methods 

Some  auxiliary  methods  have  been  studied  in  expert  search  to  enhance  the  effective¬ 
ness,  e.g  automatic  detection  of  expert  home  page,  using  HTML  structures,  the  link 
structure  in  the  collection,  etc. 

We  have  also  tested  for  a  auxiliary  method  for  expert  search.  We  adopted  a  method 
for  automatic  detecting  phrases  in  the  query.  For  two  adjacent  terms  in  the  query,  i.e. 
ti  and  tj,  we  adopt  p(tjtj\ti,tj)  to  determine  whether  tfj  is  a  phrase,  which  is  the  proba¬ 
bility  of  tj  and  tj  are  adjacent  when  both  of  them  appear  in  the  documents.  Further,  we 
adopt  a  threshold  value  for  p(titj\thtj)  to  filter  term  pairs  that  are  not  closely  connected. 
Our  previous  experiments  indicated  that  this  method  is  profitable  in  the  TREC  2007 
collection. 


3  Evaluation 

We  submitted  four  runs  this  year  to  testify  our  methods  proposed  in  section  2. 

First,  we  submitted  a  group  of  runs  to  testify  whether  our  method  considering  each 
expert  evidence  is  profitable: 

WHU08BASE:  it  mostly  adopted  the  model  2  in  [3].  But  we  simplified  p(d\c),  the 
candidate-document  association.  We  simply  set  p(d\c)  to  1  if  c  appears  in  d,  since  our 
previous  experiments  indicated  that  this  simplification  can  enhance  effectiveness.  We 
used  a  combination  of  both  eVfn  and  evem  in  this  run.  evabbr  is  ignored  since  it  reduced 
the  effectiveness  in  our  experiments.  Phrase  detection  is  used. 

WHU08CAN:  it  adopted  the  model  2  in  [3],  and  p{d\c)  is  estimated  in  Eq.  (9).  We 
used  a  combination  of  both  eVf„  and  evem  in  this  run.  evabbr  is  ignored  since  it  reduced 
the  effectiveness  in  our  experiments.  Phrase  detection  is  used. 

WHU08RFCAN:  it  adopted  our  proposed  method  to  consider  each  kind  of  ev,.  We 
used  ev/„,  evem,  and  evahhr  in  run.  Phrase  detection  is  used. 

Table.  1  gives  out  an  overview  of  the  evaluation  results  for  three  runs.  We  can  find 
out  that  our  proposed  method,  i.e.  WHU08RFCAN,  outperformed  WHU08CAN. 


Table  1.  A  comparison  in  previous  models  and  our  proposed  method. 


Runs 

MAP 

P@5 

P@10 

R-prec 

recip-rank 

WHU08BASE 

0.4255 

0.3509 

0.3389 

0.6563 

WHU08CAN 

0.4509 

0.3345 

0.3484 

0.6296 

WHU08RFCAN 

0.4909 

0.3455 

0.3579 

0.6884 

Besides,  we  have  also  tested  for  the  phrase  detection  method: 

WHU08NOPHR:  it  adopted  the  method  in  WHU08BASE,  but  it  did  not  used  the 
phrase  detection. 

However,  our  experiments  indicated  that  this  method  did  not  achieve  better  results 
in  TREC  2008,  although  our  previous  experiments  in  TREC  2007  queries  showed  it  is 
profitable. 


Table  2.  Evaluation  results  for  the  phrase  detection  method. 


Runs 

MAP 

P@5 

P@10 

R-prec 

recip-rank 

WHU08BASE 

0.4255 

0.3509 

0.3389 

0.6563 

WHU08NOPHR 

0.4909 

0.3655 

0.3665 

0.6770 

4  Conclusion 

In  this  paper,  we  described  our  methods  adopted  in  the  TREC  2008  expert  search  task. 

First,  we  proposed  an  adaption  to  the  language  modeling  method  for  expert  search, 
which  considers  the  probability  of  query  generation  separately  using  different  kinds  of 
expert  evidence  (full  name,  abbreviated  name,  and  email).  Our  experiments  indicated 
that  this  method  can  effectively  make  use  of  the  ambiguous  evidence  in  expert  search 
(e.g.  abbreviated  name).  Previous  expert  search  models  can  be  easily  integrated  into 
the  method  proposed  here.  Besides,  we  have  adopted  a  phrase  detection  method  in  our 
experiments,  but  it  failed  to  make  better  results. 

In  the  future,  we  plan  to  further  investigate  on  the  proposed  method  that  considered 
each  kind  of  evidence.  First,  we  only  adopted  the  most  frequently  used  method  for  the 
estimation  of p(q |ev,),  which  should  be  tested  using  more  methods.  Second,  in  Eq.  (3), 
Eq.  (4),  Eq.  (5),  and  Eq.  (6),  we  only  used  the  simple  maximum  likelihood  estimation 
method,  which  can  be  refined  in  the  future. 
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